ReBNN: Resilient Binary Neural Network
89
(a). ReActNet
(b). ReBNN
Initial
32-th
64-th
96-th
128-th
160-th
192-th
224-th
256-th
FIGURE 3.29
The evolution of latent weight distribution of (a) ReActNet and (b) ReBNN. We select
the first channel of the first binary convolution layer to show the evolution. The model is
initialized from the first stage training with W32A1 following [158]. We plot the distribution
every 32 epochs.
sign flip, thus hindering the training. Inspired by this, we use Eq. (3.150) to calculate γ
and improve performance by 0.6%, showing that considering the proportion of the weight
oscillation allows for the necessary sign flip and leads to more effective training. We also
show the training loss curves in Fig. 3.30(b). As plotted, the L curves almost demonstrate
the training sufficiency degrees. Therefore, we conclude that ReBNN with γ calculated by
Eq. (3.150) achieves the lowest training loss and an efficient training process. Note that the
loss may not be minimal at each training iteration. Still, our method is just a reasonable
version of gradient descent algorithms, which can be used to solve the optimization prob-
lem as a general one. We empirically prove ReBNN’s capability of mitigating the weight
oscillation, leading to better convergence.
Resilient training process: This section shows the evolution of the latent weight distri-
bution. We plot the distribution of the first binary convolution layer’s first channel per 32
epochs in Fig. 3.29. As seen, our ReBNN can efficiently redistribute the BNNs toward re-
silience. Conventional ReActNet [158] possesses a tri-model distribution, which is unstable
due to the scaling factor with large magnitudes. In contrast, our ReBNN is constrained by
the balanced parameter γ during training, thus leading to a resilient bi-modal distribution
with fewer weights centering around zero. We also plot the ratios of sequential weight os-
cillation of ReBNN and ReActNet for the 1-st, 8-th, and 16-th binary convolution layers
TABLE 3.7
We compare different calculation
methods of γ, including constants that
vary from 0 to 1e−2 and
gradient-based calculation.
Value of γ
Top-1
Top-5
0
65.8
86.3
1e−5
66.2
86.7
1e−4
66.4
86.7
1e−3
66.3
86.8
1e−2
65.9
86.5
max1≤j≤M n(|
∂L
∂ˆwn,t
i,j |)
66.3
86.2
Eq. (3.150)
66.9
87.1